image.png

image.png

Example¶

image.png

Another Example¶

image.png

image.png

image.png

image.png


Principal Component Analysis (PCA)¶


1. What is PCA?¶

PCA is a dimensionality reduction technique that transforms a large set of variables (features) into a smaller one that still contains most of the information (variance) in the large set.

  • It does this by projecting the data onto a new set of orthogonal axes (principal components), ordered by how much variance they explain.
  • PCA is unsupervised — it ignores the target variable y.

2. Why Use PCA?¶

Reason Explanation
High Dimensionality Too many input features (columns) lead to the "curse of dimensionality" — data becomes sparse, slow, and prone to overfitting.
Redundancy Many features are correlated (e.g., height and arm length); PCA removes multicollinearity.
Noise Reduction PCA filters out noise by keeping only the most informative components.
Visualization Helps visualize high-dimensional data in 2D or 3D (e.g., PC1 vs PC2).
Preprocessing Makes downstream ML tasks like clustering or classification more efficient.

3. How Does PCA Work? (Step-by-Step)¶

  1. Standardize the Data

    • Mean = 0, Variance = 1 for all features
    • PCA is sensitive to feature scales.
  2. Compute the Covariance Matrix

    • Measures how features vary with each other.
  3. Calculate Eigenvalues and Eigenvectors

    • Eigenvectors → Principal Components (PCs)
    • Eigenvalues → Importance (variance explained) by each PC
  4. Sort and Select Top-k Components

    • Sort PCs by eigenvalues in descending order
    • Choose top k components based on desired variance (e.g., 95%)
  5. Project Original Data onto New k-Dimensional Space


4. Important Pointers¶

Concept Detail
Variance Explained You can retain 95–99% variance using fewer components.
PCs are Orthogonal No correlation between PCs.
Linear Method PCA only captures linear patterns (no curves or complex boundaries).
Unsupervised It does not use y (target). PCA is purely based on X.
Lossy Transformation Original features can't be perfectly reconstructed from PCs.

5. Advantages of PCA¶

  • Reduces Overfitting – Less redundant features
  • Faster Computation – Smaller data = quicker training
  • Better Generalization – Removes noise
  • Visual Insights – PCA helps explore patterns in 2D plots
  • Removes Multicollinearity – Helpful before regression

6. Disadvantages of PCA¶

  • Loss of Interpretability – You lose original feature meaning (e.g., PC1 = 0.4X1 + 0.6X2...)
  • Only Linear – Cannot capture non-linear relationships
  • Scaling Required – Sensitive to feature magnitudes
  • Doesn’t Consider Target – Might reduce features important for prediction
  • Not Good for Sparse Data – In text mining (like TF-IDF), PCA may not preserve important word features

7. Corner Cases / Gotchas¶

Case Explanation
Features on Different Scales PCA performs poorly without standardization — always scale features first.
Too Few Samples If samples < features (e.g., gene expression data), PCA may overfit.
Missing Values PCA can’t handle missing values; impute or drop them first.
Categorical Variables PCA only works with numeric data. Encode categoricals appropriately.
Highly Non-linear Data Use Kernel PCA or t-SNE instead for curved manifolds.
Applying PCA After TTS Always fit PCA on X_train, then transform X_test — or you’ll leak information from test data.

8. When to Use PCA? (Ideal Scenarios)¶

Scenario Use PCA? Notes
Too Many Features Yes e.g., 100+ columns in CSV
Multicollinearity Present Yes Good before linear regression
Noise in Data Yes PCA can filter noise
Data Visualization Yes Use top 2–3 PCs for plots
Sparse or Text Data No Use TruncatedSVD or LSA instead
You Need Interpretability No Use feature selection instead
Target is Important No Use supervised methods like LDA (Linear Discriminant Analysis)

9. Choosing Number of Components (k)¶

Use the explained variance ratio:

from sklearn.decomposition import PCA

pca = PCA().fit(X_scaled)
explained_variance = pca.explained_variance_ratio_.cumsum()

# Choose minimum k such that variance > 95%

Plot explained_variance_ratio_.cumsum() to find the elbow point.


10. Best Practices¶

  • Standardize the features (e.g., StandardScaler)
  • Use PCA only on numeric features
  • Retain 95–99% variance (tunable)
  • Always apply PCA after train-test split, not before
  • Keep a copy of PCA object to transform future data

11. Alternative Techniques¶

Method Use When
LDA You want dimensionality + classification (uses target y)
t-SNE / UMAP For non-linear visualization
Feature Selection When interpretability is critical
Autoencoders For non-linear compression using neural networks

In [1]:
# PCA Implementation

# Lets do the necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv('data.csv')
df.head()
Out[2]:
number_people date timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 37 2015-08-14 17:00:11-07:00 61211 4 0 0 71.76 0 0 8 17
1 45 2015-08-14 17:20:14-07:00 62414 4 0 0 71.76 0 0 8 17
2 40 2015-08-14 17:30:15-07:00 63015 4 0 0 71.76 0 0 8 17
3 44 2015-08-14 17:40:16-07:00 63616 4 0 0 71.76 0 0 8 17
4 45 2015-08-14 17:50:17-07:00 64217 4 0 0 71.76 0 0 8 17
In [3]:
df.columns
Out[3]:
Index(['number_people', 'date', 'timestamp', 'day_of_week', 'is_weekend',
       'is_holiday', 'temperature', 'is_start_of_semester',
       'is_during_semester', 'month', 'hour'],
      dtype='object')
In [4]:
# Problem Statement:
# Crowdedness at the Campus Gym using PCA
# Data Description for columns: 'number_people', 'date', 'timestamp', 'day_of_week', 'is_weekend' 'is_holiday', 'temperature', 'is_start_of_semester', 'is_during_semester', 'month', 'hour'

# number_people: Number of students present in the gym at a given time  
# date: Date of the observation in YYYY-MM-DD format
# timestamp: Time of the observation
# day_of_week: Day of the week (0=Monday, 6=Sunday)
# is_weekend: Boolean indicating if the observation is on a weekend (Saturday or Sunday)
# is_holiday: Boolean indicating if the observation is on a holiday
# temperature: Temperature in degrees Celsius at the time of observation
# is_start_of_semester: Boolean indicating if the observation is during the start of a semester
# is_during_semester: Boolean indicating if the observation is during an active semester
# month: Month of the observation (1=January, 12=December)
# hour: Hour of the day (0-23) when the observation was made
In [5]:
df.describe()
Out[5]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
count 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000
mean 29.072543 45799.437958 2.982504 0.282870 0.002573 58.557108 0.078831 0.660218 7.439824 12.236460
std 22.689026 24211.275891 1.996825 0.450398 0.050660 6.316396 0.269476 0.473639 3.445069 6.717631
min 0.000000 0.000000 0.000000 0.000000 0.000000 38.140000 0.000000 0.000000 1.000000 0.000000
25% 9.000000 26624.000000 1.000000 0.000000 0.000000 55.000000 0.000000 0.000000 5.000000 7.000000
50% 28.000000 46522.500000 3.000000 0.000000 0.000000 58.340000 0.000000 1.000000 8.000000 12.000000
75% 43.000000 66612.000000 5.000000 1.000000 0.000000 62.280000 0.000000 1.000000 10.000000 18.000000
max 145.000000 86399.000000 6.000000 1.000000 1.000000 87.170000 1.000000 1.000000 12.000000 23.000000
In [6]:
df.shape
Out[6]:
(62184, 11)
In [7]:
df.corr(numeric_only=True)
Out[7]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
number_people 1.000000 0.550218 -0.162062 -0.173958 -0.048249 0.373327 0.182683 0.335350 -0.097854 0.552049
timestamp 0.550218 1.000000 -0.001793 -0.000509 0.002851 0.184849 0.009551 0.044676 -0.023221 0.999077
day_of_week -0.162062 -0.001793 1.000000 0.791338 -0.075862 0.011169 -0.011782 -0.004824 0.015559 -0.001914
is_weekend -0.173958 -0.000509 0.791338 1.000000 -0.031899 0.020673 -0.016646 -0.036127 0.008462 -0.000517
is_holiday -0.048249 0.002851 -0.075862 -0.031899 1.000000 -0.088527 -0.014858 -0.070798 -0.094942 0.002843
temperature 0.373327 0.184849 0.011169 0.020673 -0.088527 1.000000 0.093242 0.152476 0.063125 0.185121
is_start_of_semester 0.182683 0.009551 -0.011782 -0.016646 -0.014858 0.093242 1.000000 0.209862 -0.137160 0.010091
is_during_semester 0.335350 0.044676 -0.004824 -0.036127 -0.070798 0.152476 0.209862 1.000000 0.096556 0.045581
month -0.097854 -0.023221 0.015559 0.008462 -0.094942 0.063125 -0.137160 0.096556 1.000000 -0.023624
hour 0.552049 0.999077 -0.001914 -0.000517 0.002843 0.185121 0.010091 0.045581 -0.023624 1.000000
In [8]:
# The temperature given here is in fahrenheit. We will convert it into Celsius using the formula Celsius=(Fahrenheit-32) * (5/9)
Fahrenheit=df['temperature']

# Converting it into the list so we can apply lambda function
F=Fahrenheit.tolist()

# Applying Lambda function
C= map(lambda x: (float(5)/9)*(x-32),F)
Celsius=(list(C))

# Converting list to series
temperature_celsius = pd.Series(Celsius)

# Applying the series to temperature column
df['temperature']= temperature_celsius
df['temperature']
df.head()
# Thus we have converted the temperature column from fahrenheit to degree celsius.
Out[8]:
number_people date timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 37 2015-08-14 17:00:11-07:00 61211 4 0 0 22.088889 0 0 8 17
1 45 2015-08-14 17:20:14-07:00 62414 4 0 0 22.088889 0 0 8 17
2 40 2015-08-14 17:30:15-07:00 63015 4 0 0 22.088889 0 0 8 17
3 44 2015-08-14 17:40:16-07:00 63616 4 0 0 22.088889 0 0 8 17
4 45 2015-08-14 17:50:17-07:00 64217 4 0 0 22.088889 0 0 8 17
In [9]:
X = df.iloc[:,1:]  # all rows, all the features and no labels
y = df.iloc[:, 0]  # all rows, label only

# Problem Statement:
# Crowdedness at the Campus Gym using PCA
# y - number_people: Number of students present in the gym at a given time
# Therefore, we have to predict the number of people in the gym using PCA based on the features given in the dataset.
In [10]:
X.head()
Out[10]:
date timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 2015-08-14 17:00:11-07:00 61211 4 0 0 22.088889 0 0 8 17
1 2015-08-14 17:20:14-07:00 62414 4 0 0 22.088889 0 0 8 17
2 2015-08-14 17:30:15-07:00 63015 4 0 0 22.088889 0 0 8 17
3 2015-08-14 17:40:16-07:00 63616 4 0 0 22.088889 0 0 8 17
4 2015-08-14 17:50:17-07:00 64217 4 0 0 22.088889 0 0 8 17
In [11]:
y.head()
Out[11]:
0    37
1    45
2    40
3    44
4    45
Name: number_people, dtype: int64
In [12]:
correlation = df.corr(numeric_only=True)
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between different features')
plt.show()
No description has been provided for this image
In [13]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  62184 non-null  object 
 1   timestamp             62184 non-null  int64  
 2   day_of_week           62184 non-null  int64  
 3   is_weekend            62184 non-null  int64  
 4   is_holiday            62184 non-null  int64  
 5   temperature           62184 non-null  float64
 6   is_start_of_semester  62184 non-null  int64  
 7   is_during_semester    62184 non-null  int64  
 8   month                 62184 non-null  int64  
 9   hour                  62184 non-null  int64  
dtypes: float64(1), int64(8), object(1)
memory usage: 4.7+ MB
In [14]:
X.drop('date',axis=1,inplace=True)
X.columns
Out[14]:
Index(['timestamp', 'day_of_week', 'is_weekend', 'is_holiday', 'temperature',
       'is_start_of_semester', 'is_during_semester', 'month', 'hour'],
      dtype='object')
In [15]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   timestamp             62184 non-null  int64  
 1   day_of_week           62184 non-null  int64  
 2   is_weekend            62184 non-null  int64  
 3   is_holiday            62184 non-null  int64  
 4   temperature           62184 non-null  float64
 5   is_start_of_semester  62184 non-null  int64  
 6   is_during_semester    62184 non-null  int64  
 7   month                 62184 non-null  int64  
 8   hour                  62184 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 4.3 MB
In [16]:
# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shape of the training and testing sets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)
Training set shape: (49747, 9)
Testing set shape: (12437, 9)
Training labels shape: (49747,)
Testing labels shape: (12437,)
In [17]:
# Apply StandardScaler to the features
features = X_train.columns.tolist()
# features

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[features] = sc.fit_transform(X_train[features])
X_test[features] = sc.transform(X_test[features])
In [18]:
X_train.shape
Out[18]:
(49747, 9)
In [19]:
# PCA
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

# class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)

from sklearn.decomposition import PCA
pca = PCA()
# Fit PCA on the training data
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Explained Variance Ratio: It returns the percentage of variance explained by each of the selected components.
explained_variance = pca.explained_variance_ratio_
print("The principal components explain the following variance:")
for i, var in enumerate(explained_variance):
    print(f"Principal Component {i+1}: {var:.4f}")
The principal components explain the following variance:
Principal Component 1: 0.2305
Principal Component 2: 0.2003
Principal Component 3: 0.1456
Principal Component 4: 0.1287
Principal Component 5: 0.1019
Principal Component 6: 0.0928
Principal Component 7: 0.0773
Principal Component 8: 0.0229
Principal Component 9: 0.0001
In [20]:
# lets plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio by Principal Components')
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()
No description has been provided for this image
In [21]:
# Lets do the modeling using Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

# Fit the model on the training data
model.fit(X_train_pca, y_train)

# Predict on the train and test data
y_train_pred = model.predict(X_train_pca)
y_test_pred = model.predict(X_test_pca)

# Evaluate the model - r2, RMSE
from sklearn.metrics import r2_score, mean_squared_error
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"Training R^2: {train_r2:.4f}")
print(f"Testing R^2: {test_r2:.4f}")
print(f"Training RMSE: {train_rmse:.4f}")
print(f"Testing RMSE: {test_rmse:.4f}")
Training R^2: 0.9861
Testing R^2: 0.9007
Training RMSE: 2.6718
Testing RMSE: 7.1626
In [ ]:
# Install plotly
# %pip install --upgrade pip -q
# %pip install plotly -q
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
In [28]:
print(explained_variance,"\n\n")
print(np.cumsum(explained_variance))
[2.30519231e-01 2.00263856e-01 1.45583261e-01 1.28687653e-01
 1.01905642e-01 9.27520152e-02 7.73183498e-02 2.28677834e-02
 1.02208707e-04] 


[0.23051923 0.43078309 0.57636635 0.705054   0.80695964 0.89971166
 0.97703001 0.99989779 1.        ]
In [24]:
# Lets plot the cumulative explained variance using plotly
import plotly.express as px
cumulative_variance = np.cumsum(explained_variance)
fig = px.line(x=range(1, len(cumulative_variance) + 1), y=cumulative_variance, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
              title='Cumulative Explained Variance by Principal Components')
fig.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig.show()
# This plot shows how much variance is explained as we add more principal components, helping us decide how many components to keep for further analysis or modeling.
# The cumulative explained variance plot is useful for determining the number of principal components to retain in PCA
# based on the desired level of explained variance.
In [34]:
# Lets apply PCA with 2 components
pca_2 = PCA(n_components=2)
X_train_pca_2 = pca_2.fit_transform(X_train)
X_test_pca_2 = pca_2.transform(X_test)

# Now we can visualize the data in 2D using the first two principal components
plt.figure(figsize=(10, 6))
plt.scatter(X_train_pca_2[:, 0], X_train_pca_2[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=50)
plt.colorbar(label='Number of People')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Visualization of Training Data')
plt.show()
# The scatter plot shows the distribution of the training data in the new PCA space, where each point represents a sample, and the color indicates the number of people present in the gym.

# Lets visualize the explained variance ratio for the first two principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 3), explained_variance[:2], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Two Principal Components')
plt.xticks(range(1, 3))
plt.show()
# The bar plot shows the explained variance ratio for the first two principal components, indicating how much variance each component captures in the data.

# Lets visualize the cumulative explained variance for the first two principal components using plotly
cumulative_variance_2 = np.cumsum(explained_variance[:2])
fig_2 = px.line(x=range(1, 3), y=cumulative_variance_2, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                title='Cumulative Explained Variance for First Two Principal Components')
fig_2.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_2.show()
# The cumulative explained variance plot for the first two principal components shows how much total variance is explained
# when considering both components, helping us understand the overall information captured by these two dimensions.

# Lets apply Random Forest Regressor on the PCA transformed data with 2 components
model_2 = RandomForestRegressor()
# Fit the model on the training data with 2 PCA components
model_2.fit(X_train_pca_2, y_train)

# Predict on the train and test data with 2 PCA components
y_train_pred_2 = model_2.predict(X_train_pca_2)
y_test_pred_2 = model_2.predict(X_test_pca_2)

# Evaluate the model with 2 PCA components - r2, RMSE
train_r2_2 = r2_score(y_train, y_train_pred_2)
test_r2_2 = r2_score(y_test, y_test_pred_2)
train_rmse_2 = np.sqrt(mean_squared_error(y_train, y_train_pred_2))
test_rmse_2 = np.sqrt(mean_squared_error(y_test, y_test_pred_2))
print(f"Training R^2 with 2 PCA components: {train_r2_2:.4f}")
print(f"Testing R^2 with 2 PCA components: {test_r2_2:.4f}")
print(f"Training RMSE with 2 PCA components: {train_rmse_2:.4f}")
print(f"Testing RMSE with 2 PCA components: {test_rmse_2:.4f}")
No description has been provided for this image
No description has been provided for this image
Training R^2 with 2 PCA components: 0.9750
Testing R^2 with 2 PCA components: 0.8343
Training RMSE with 2 PCA components: 3.5850
Testing RMSE with 2 PCA components: 9.2530
In [39]:
# Lets apply PCA with 3 components
pca_3 = PCA(n_components=3)
X_train_pca_3 = pca_3.fit_transform(X_train)
X_test_pca_3 = pca_3.transform(X_test)

# Now we can visualize the data in 3D using the first three principal components
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train_pca_3[:, 0], X_train_pca_3[:, 1], X_train_pca_3[:, 2], c=y_train, cmap='viridis', edgecolor='k', s=50)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('PCA: 3D Visualization of Training Data')
plt.colorbar(ax.scatter(X_train_pca_3[:, 0], X_train_pca_3[:, 1], X_train_pca_3[:, 2], c=y_train, cmap='viridis', edgecolor='k', s=50), label='Number of People')
plt.show()
# The 3D scatter plot shows the distribution of the training data in the new PCA space, where each point represents a sample, and the color indicates the number of people present in the gym.

# Lets visualize the explained variance ratio for the first three principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 4), explained_variance[:3], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Three Principal Components')
plt.xticks(range(1, 4))
plt.show()
# The bar plot shows the explained variance ratio for the first three principal components, indicating how much variance each component captures in the data.

# Lets visualize the cumulative explained variance for the first three principal components using plotly
cumulative_variance_3 = np.cumsum(explained_variance[:3])
fig_3 = px.line(x=range(1, 4), y=cumulative_variance_3, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                title='Cumulative Explained Variance for First Three Principal Components')
fig_3.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_3.show()
# The cumulative explained variance plot for the first three principal components shows how much total variance is explained
# when considering all three components, helping us understand the overall information captured by these three dimensions.

# Lets apply Random Forest Regressor on the PCA transformed data with 3 components
model_3 = RandomForestRegressor()
# Fit the model on the training data with 3 PCA components
model_3.fit(X_train_pca_3, y_train)

# Predict on the train and test data with 3 PCA components
y_train_pred_3 = model_3.predict(X_train_pca_3)
y_test_pred_3 = model_3.predict(X_test_pca_3)

# Evaluate the model with 3 PCA components - r2, RMSE
train_r2_3 = r2_score(y_train, y_train_pred_3)
test_r2_3 = r2_score(y_test, y_test_pred_3)
train_rmse_3 = np.sqrt(mean_squared_error(y_train, y_train_pred_3))
test_rmse_3 = np.sqrt(mean_squared_error(y_test, y_test_pred_3))
print(f"Training R^2 with 3 PCA components: {train_r2_3:.4f}")
print(f"Testing R^2 with 3 PCA components: {test_r2_3:.4f}")
print(f"Training RMSE with 3 PCA components: {train_rmse_3:.4f}")
print(f"Testing RMSE with 3 PCA components: {test_rmse_3:.4f}")
No description has been provided for this image
No description has been provided for this image
Training R^2 with 3 PCA components: 0.9852
Testing R^2 with 3 PCA components: 0.8995
Training RMSE with 3 PCA components: 2.7578
Testing RMSE with 3 PCA components: 7.2088
In [43]:
# Lets apply PCA with 4 components
pca_4 = PCA(n_components=4)
X_train_pca_4 = pca_4.fit_transform(X_train)
X_test_pca_4 = pca_4.transform(X_test)

# Since it is 4D, we wont go for visualization in 4D.

# Lets plot the explained variance ratio for the first four principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 5), explained_variance[:4], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Four Principal Components')
plt.xticks(range(1, 5))
plt.show()

# Lets visualize the cumulative explained variance for the first four principal components using plotly
cumulative_variance_4 = np.cumsum(explained_variance[:4])
fig_4 = px.line(x=range(1, 5), y=cumulative_variance_4, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                title='Cumulative Explained Variance for First Four Principal Components')
fig_4.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_4.show()
# The cumulative explained variance plot for the first four principal components shows how much total variance is explained
# when considering all four components, helping us understand the overall information captured by these four dimensions.

# Lets apply Random Forest Regressor on the PCA transformed data with 4 components
model_4 = RandomForestRegressor()
# Fit the model on the training data with 4 PCA components
model_4.fit(X_train_pca_4, y_train)

# Predict on the train and test data with 4 PCA components
y_train_pred_4 = model_4.predict(X_train_pca_4)
y_test_pred_4 = model_4.predict(X_test_pca_4)

# Evaluate the model with 4 PCA components - r2, RMSE
train_r2_4 = r2_score(y_train, y_train_pred_4)
test_r2_4 = r2_score(y_test, y_test_pred_4)
train_rmse_4 = np.sqrt(mean_squared_error(y_train, y_train_pred_4))
test_rmse_4 = np.sqrt(mean_squared_error(y_test, y_test_pred_4))
print(f"Training R^2 with 4 PCA components: {train_r2_4:.4f}")
print(f"Testing R^2 with 4 PCA components: {test_r2_4:.4f}")
print(f"Training RMSE with 4 PCA components: {train_rmse_4:.4f}")
print(f"Testing RMSE with 4 PCA components: {test_rmse_4:.4f}")
No description has been provided for this image
Training R^2 with 4 PCA components: 0.9884
Testing R^2 with 4 PCA components: 0.9211
Training RMSE with 4 PCA components: 2.4465
Testing RMSE with 4 PCA components: 6.3857
In [44]:
# Lets apply PCA with 5 components

pca_5 = PCA(n_components=5)
X_train_pca_5 = pca_5.fit_transform(X_train)
X_test_pca_5 = pca_5.transform(X_test)

# Since it is 5D, we wont go for visualization in 5D.

# Lets plot the explained variance ratio for the first five principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 6), explained_variance[:5], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Five Principal Components')
plt.xticks(range(1, 6))
plt.show()

# Lets visualize the cumulative explained variance for the first five principal components using plotly
cumulative_variance_5 = np.cumsum(explained_variance[:5])
fig_5 = px.line(x=range(1, 6), y=cumulative_variance_5, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                title='Cumulative Explained Variance for First Five Principal Components')
fig_5.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_5.show()
# The cumulative explained variance plot for the first five principal components shows how much total variance is explained
# when considering all five components, helping us understand the overall information captured by these five dimensions.

# Lets apply Random Forest Regressor on the PCA transformed data with 5 components
model_5 = RandomForestRegressor()
# Fit the model on the training data with 5 PCA components
model_5.fit(X_train_pca_5, y_train)

# Predict on the train and test data with 5 PCA components
y_train_pred_5 = model_5.predict(X_train_pca_5)
y_test_pred_5 = model_5.predict(X_test_pca_5)

# Evaluate the model with 5 PCA components - r2, RMSE
train_r2_5 = r2_score(y_train, y_train_pred_5)
test_r2_5 = r2_score(y_test, y_test_pred_5)
train_rmse_5 = np.sqrt(mean_squared_error(y_train, y_train_pred_5))
test_rmse_5 = np.sqrt(mean_squared_error(y_test, y_test_pred_5))
print(f"Training R^2 with 5 PCA components: {train_r2_5:.4f}")
print(f"Testing R^2 with 5 PCA components: {test_r2_5:.4f}")
print(f"Training RMSE with 5 PCA components: {train_rmse_5:.4f}")
print(f"Testing RMSE with 5 PCA components: {test_rmse_5:.4f}")
No description has been provided for this image
Training R^2 with 5 PCA components: 0.9882
Testing R^2 with 5 PCA components: 0.9190
Training RMSE with 5 PCA components: 2.4613
Testing RMSE with 5 PCA components: 6.4684
In [45]:
# Lets plot r2 for train and test data for different number of components 9,2,3,4,5
import matplotlib.pyplot as plt
components = [2, 3, 4, 5, 9]
train_r2_values = [train_r2_2, train_r2_3, train_r2_4, train_r2_5, train_r2]
test_r2_values = [test_r2_2, test_r2_3, test_r2_4, test_r2_5, test_r2]
plt.figure(figsize=(10, 6))
plt.plot(components, train_r2_values, marker='o', label='Training R^2')
plt.plot(components, test_r2_values, marker='o', label='Testing R^2')
plt.xlabel('Number of Principal Components')
plt.ylabel('R^2 Score')
plt.title('R^2 Score for Different Number of Principal Components')
plt.legend()
plt.xticks(components)
plt.grid()
plt.show()
# The line plot shows the R^2 scores for both training and testing data as the number of principal components increases.
No description has been provided for this image
In [47]:
# PCA

# class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)

# n_components: int, float or ‘mle’, default=None

# Lets apply PCA to capture 80% of variance
pca_80 = PCA(n_components=0.80)
X_train_pca_80 = pca_80.fit_transform(X_train)
X_test_pca_80 = pca_80.transform(X_test)

# Lets not visualize the data

# Lets plot the explained variance ratio for the PCA with 80% variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca_80.explained_variance_ratio_) + 1), pca_80.explained_variance_ratio_, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for PCA with 80% Variance')
plt.xticks(range(1, len(pca_80.explained_variance_ratio_) + 1))
plt.show()

# Lets visualize the cumulative explained variance for the PCA with 80% variance using plotly
cumulative_variance_80 = np.cumsum(pca_80.explained_variance_ratio_)
fig_80 = px.line(x=range(1, len(cumulative_variance_80) + 1), y=cumulative_variance_80, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                title='Cumulative Explained Variance for PCA with 80% Variance')
fig_80.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_80.show()
# The cumulative explained variance plot for PCA with 80% variance shows how much total variance is explained
# when considering the principal components that capture 80% of the variance, helping us understand
# the overall information captured by these components.

# Lets apply Random Forest Regressor on the PCA transformed data with 80% variance
model_80 = RandomForestRegressor()
# Fit the model on the training data with PCA components that capture 80% variance
model_80.fit(X_train_pca_80, y_train)

# Predict on the train and test data with PCA components that capture 80% variance
y_train_pred_80 = model_80.predict(X_train_pca_80)
y_test_pred_80 = model_80.predict(X_test_pca_80)

# Evaluate the model with PCA components that capture 80% variance - r2, RMSE
train_r2_80 = r2_score(y_train, y_train_pred_80)
test_r2_80 = r2_score(y_test, y_test_pred_80)
train_rmse_80 = np.sqrt(mean_squared_error(y_train, y_train_pred_80))
test_rmse_80 = np.sqrt(mean_squared_error(y_test, y_test_pred_80))
print(f"Training R^2 with PCA components that capture 80% variance: {train_r2_80:.4f}")
print(f"Testing R^2 with PCA components that capture 80% variance: {test_r2_80:.4f}")
print(f"Training RMSE with PCA components that capture 80% variance: {train_rmse_80:.4f}")
print(f"Testing RMSE with PCA components that capture 80% variance: {test_rmse_80:.4f}")
No description has been provided for this image
Training R^2 with PCA components that capture 80% variance: 0.9883
Testing R^2 with PCA components that capture 80% variance: 0.9192
Training RMSE with PCA components that capture 80% variance: 2.4549
Testing RMSE with PCA components that capture 80% variance: 6.4608
In [ ]:
 

image.png


t-SNE (t-distributed Stochastic Neighbor Embedding)¶


1. What is t-SNE?¶

t-SNE is a non-linear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton. It is mainly used for visualizing high-dimensional data in two or three dimensions.

Unlike PCA, which preserves global structure and variance, t-SNE focuses on preserving local structure — meaning it aims to keep similar points close together in the low-dimensional space.


2. Why Use t-SNE?¶

In many real-world datasets, the number of features (columns) is high. Visualizing such data becomes impossible. t-SNE helps to:

  • Reduce high-dimensional data to 2D or 3D for visual exploration.
  • Reveal hidden patterns, groupings, or clusters.
  • Understand relationships between data points without building a model.

3. Intuition Behind t-SNE¶

Here’s how t-SNE works in simplified terms:

a. High-Dimensional Similarities¶

  • t-SNE calculates the probability that two data points are similar based on their distances.
  • Nearby points have high probabilities; distant points have low probabilities.
  • These probabilities are computed using Gaussian distribution in high dimensions.

b. Low-Dimensional Similarities¶

  • In the target low-dimensional space (usually 2D), t-SNE tries to recreate a similar structure.
  • But instead of a Gaussian, it uses a Student's t-distribution (with 1 degree of freedom) to measure pairwise similarities.

c. Minimize Divergence¶

  • t-SNE minimizes the Kullback–Leibler (KL) divergence between the two probability distributions:

    • High-dimensional (input space)
    • Low-dimensional (visualization space)
  • The goal is to map similar points close together and dissimilar points far apart.


4. Key Parameters in t-SNE¶

Parameter Description Tips
n_components Number of output dimensions (usually 2 or 3) Use 2D for plots
perplexity Balance between local and global structure (roughly: number of neighbors to consider) Typical range: 5 to 50
learning_rate Step size for optimization Range: 10 to 1000; too low/high may fail
n_iter Number of optimization iterations At least 250; 1000+ recommended
random_state Ensures reproducibility Use fixed seed like 42

5. Example Use Cases¶

  • Visualizing word embeddings (like Word2Vec or GloVe)
  • Customer segmentation in marketing
  • Clustering of gene expression profiles
  • Analyzing MNIST handwritten digits
  • Understanding embeddings from deep learning models

6. What Does a t-SNE Plot Show?¶

  • Each point in the plot represents a high-dimensional data point (e.g., a row in your dataset).
  • Points that appear close together in the plot were similar in the original high-dimensional space.
  • If you color points by class label (if available), you'll often see natural clustering or class separation.

7. Example Interpretation¶

Consider a t-SNE plot of handwritten digits:

  • Cluster of “3”s appears in one region.
  • Cluster of “8”s is nearby but separate.
  • Some overlapping may exist if digits are visually similar.

This indicates that the digit embeddings are separable based on their latent features, and classes are locally grouped.


8. Limitations of t-SNE¶

Limitation Explanation
Not deterministic Can yield different plots each run (unless random_state is fixed)
No inverse transform You can't reconstruct original data from t-SNE outputs
Only visualization t-SNE is not meant for preprocessing before modeling
Computationally intensive Slow for large datasets
Misleading global structure t-SNE preserves local structure well, not global distances

9. Best Practices¶

  • Standardize the data before applying t-SNE.
  • Try different perplexity values to find stable patterns.
  • Use color coding by labels to understand clusters better.
  • Avoid using t-SNE for preprocessing before model training.
  • Don’t over-interpret distances between far-apart clusters.

10. When to Use t-SNE?¶

Scenario Use t-SNE?
You want to visualize high-dimensional data Yes
You need to find natural clusters Yes
You need a model-ready reduced dataset No (use PCA instead)
You want to reverse-transform to original space No (not supported)
Your dataset has >10,000 samples Use with caution; may be slow

11. Alternatives to t-SNE¶

Method When to Use
PCA When variance explanation and interpretability are important
UMAP Faster, more scalable than t-SNE; preserves more global structure
Autoencoders For non-linear dimensionality reduction with reconstruction
ISOMAP For manifold learning where global geometry matters

In [50]:
X.head()
Out[50]:
timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 61211 4 0 0 22.088889 0 0 8 17
1 62414 4 0 0 22.088889 0 0 8 17
2 63015 4 0 0 22.088889 0 0 8 17
3 63616 4 0 0 22.088889 0 0 8 17
4 64217 4 0 0 22.088889 0 0 8 17
In [51]:
y.head()
Out[51]:
0    37
1    45
2    40
3    44
4    45
Name: number_people, dtype: int64
In [52]:
# Scaling the features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Scaling the features is important for PCA as it is sensitive to the variances of the features.
In [53]:
# Apply t-SNE for visualization on X_scaled
from sklearn.manifold import TSNE

# Fit and transform the PCA-reduced data using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Visualize the t-SNE results
plt.figure(figsize=(10, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.colorbar(label='Number of People')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Visualization of PCA-Reduced Data')
plt.show()
# The t-SNE plot shows the distribution of the data in a 2D space,
# where each point represents a sample, and the color indicates the number of people present in the gym.
# t-SNE is particularly useful for visualizing high-dimensional data in a lower-dimensional space
# while preserving the local structure of the data.
No description has been provided for this image

The t-SNE plot titled "t-SNE Visualization of PCA-Reduced Data" provides a meaningful 2D representation of the gym crowdedness dataset. The analysis reveals the following insights:

  1. High-dimensional data was effectively reduced using PCA followed by t-SNE, enabling visual exploration of complex patterns.

  2. Color intensity indicates the number of people present in the gym.

    • Darker points represent lower crowd levels.
    • Brighter yellow/green points represent higher crowd levels.
  3. Cluster patterns are clearly visible, indicating that certain combinations of features (e.g., time, day, conditions) correspond to distinct crowd behavior profiles.

  4. Transitional regions suggest gradual changes in crowdedness, likely due to overlapping behavioral patterns or moderate usage hours.

  5. This visualization can support:

    • Identifying peak vs. off-peak periods
    • Segmenting gym users by behavioral trends
    • Strategic resource planning (staffing, facility usage) based on crowd clusters

Overall, the t-SNE visualization provides actionable insight into gym usage dynamics and highlights natural groupings in the data that can inform operational decisions.


Key Takeaways:¶

  1. Each dot in the plot is a snapshot (e.g., a specific time or day) when people were present in the gym.

  2. The color of the dot shows how crowded the gym was at that moment:

    • Darker purple means fewer people.
    • Brighter yellow means more people.
  3. The dots are grouped based on similar usage patterns.

    • Areas where dots are close together mean similar crowd levels and behavior.
    • Spread-out areas indicate different or unusual crowd levels.
  4. Clusters of similar colors show that the gym tends to have consistent crowd levels at certain times or under certain conditions (for example, weekday evenings might form a high-crowd cluster).


In [ ]:
 

Linear Discriminant Analysis (LDA)¶


1. What is LDA?¶

LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction technique used primarily for classification tasks.

  • Unlike PCA, which maximizes variance without considering class labels, LDA seeks to maximize class separability.
  • LDA projects data onto a lower-dimensional space where the classes are most distinguishable.

2. Why Use LDA?¶

Reason Explanation
Classification-Focused LDA improves separation between classes using label (y) information.
Reduce Dimensionality Just like PCA, it reduces input features to fewer linear combinations.
Improve Model Accuracy Enhances performance of classifiers by simplifying feature space.
Better Visualization Enables 2D or 3D visual analysis of multi-class problems.

3. How Does LDA Work? (Step-by-Step)¶

  1. Compute Class-wise Mean Vectors

    • Calculate the mean vector for each class in the dataset.
  2. Compute Scatter Matrices

    • Within-class scatter matrix (SW): How data points in a class vary among themselves.
    • Between-class scatter matrix (SB): How class means vary from the overall mean.
  3. Solve the Generalized Eigenvalue Problem

    • Solve the matrix equation to find eigenvectors and eigenvalues for inv(SW) * SB.
  4. Select Linear Discriminants

    • Rank eigenvectors by eigenvalues.
    • Select top k eigenvectors to form the LDA projection matrix.
  5. Project Data

    • Multiply original data with the LDA matrix to transform into lower dimensions.

4. Key Differences: LDA vs PCA¶

Feature PCA LDA
Supervision Unsupervised (ignores y) Supervised (uses y)
Goal Maximize variance Maximize class separation
Components At most min(n_samples, n_features) At most (n_classes - 1)
Output Principal Components Linear Discriminants
Assumes Distribution No assumption Gaussian class distribution

5. Important Pointers¶

Concept Details
Uses y (labels)? Yes, LDA is supervised.
Max Dimensions Limited to n_classes - 1 components.
Gaussian Assumption Assumes each class is normally distributed with same covariance.
Linearity LDA creates linear decision boundaries.
Transformation Linear projection just like PCA, but more class-aware.

6. Advantages of LDA¶

  • Improves classification by focusing on class separability.
  • Reduces dimensionality and noise.
  • Produces interpretable projections aligned with class structure.
  • Often improves performance of classifiers like logistic regression, SVM, and Naive Bayes.

7. Disadvantages of LDA¶

  • Assumes normal distribution and equal covariance across classes — often unrealistic in real-world data.
  • Works poorly if classes are not linearly separable.
  • Struggles with imbalanced datasets — may bias towards majority class.
  • Limited to n_classes - 1 features — may not reduce enough dimensions in multi-class problems.

8. Corner Cases / Pitfalls¶

Case Problem
Non-Gaussian Features LDA may misrepresent class separability.
Highly Imbalanced Classes Between-class variance may get distorted.
Too Few Samples Leads to unstable covariance matrix (especially in high-dimensional settings).
Missing Values Must be handled before applying LDA.
Heteroscedasticity Unequal variances across classes violate assumptions.

9. When to Use LDA?¶

Scenario Use LDA? Notes
You want to reduce dimensions and improve classification Yes LDA performs well when assumptions are roughly met.
Classes are linearly separable Yes LDA finds optimal projection directions.
Need 2D visualization of multi-class data Yes LDA offers meaningful views of class structure.
High dimensional dataset with few classes Yes LDA is ideal when n_classes << n_features.
Target variable not available No LDA cannot be used without y.
Non-linear class boundaries No Try kernel LDA or t-SNE instead.

10. Best Practices¶

  • Standardize features if they differ in scale.
  • Handle missing data before applying LDA.
  • Use with classification models like SVM, logistic regression, or Naive Bayes.
  • Evaluate LDA assumptions: normality and equal covariances (optional, but recommended).
  • Plot explained variance ratio to choose number of discriminants.

11. Example: LDA for Classification¶

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Train a classifier on X_lda
clf.fit(X_lda, y)

12. Alternative Techniques¶

Method Use When
PCA You only want variance capture, not class separation
Kernel LDA Non-linear class separability
t-SNE / UMAP Non-linear data visualization
Feature Selection You want to retain original feature meaning

In [56]:
# Lets apply LDA on iris dataset from seaborn
import seaborn as sns
iris = sns.load_dataset('iris')
X = iris.drop('species', axis=1)
y = iris['species']

# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the features
features = X_train.columns.tolist()
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[features] = sc.fit_transform(X_train[features])
X_test[features] = sc.transform(X_test[features])

# Apply LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)  # n_components should be less than the number of classes - 1
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)

# Visualize the LDA results
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_train_lda[:, 0], y=X_train_lda[:, 1], hue=y_train, palette='viridis', edgecolor='k', s=100)
plt.xlabel('LDA Component 1')
plt.ylabel('LDA Component 2')
plt.title('LDA Visualization of Iris Dataset')
plt.legend(title='Species')
plt.show()
# The LDA plot shows the distribution of the training data in the new LDA space,
# where each point represents a sample, and the color indicates the species of the iris flower.

# LDA is particularly useful for classification tasks as it maximizes the separation between classes while minimizing the variance within each class.

# Evaluate the LDA model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Fit a classifier (e.g., Random Forest) on the LDA-transformed data
from sklearn.ensemble import RandomForestClassifier
model_lda = RandomForestClassifier(random_state=42)
model_lda.fit(X_train_lda, y_train)

# Predict on the train and test data
y_train_pred_lda = model_lda.predict(X_train_lda)
y_test_pred_lda = model_lda.predict(X_test_lda)

# Evaluate the model - accuracy, classification report, confusion matrix
train_accuracy_lda = accuracy_score(y_train, y_train_pred_lda)
test_accuracy_lda = accuracy_score(y_test, y_test_pred_lda)
train_classification_report_lda = classification_report(y_train, y_train_pred_lda)
test_classification_report_lda = classification_report(y_test, y_test_pred_lda)
train_confusion_matrix_lda = confusion_matrix(y_train, y_train_pred_lda)
test_confusion_matrix_lda = confusion_matrix(y_test, y_test_pred_lda)
print(f"Training Accuracy with LDA: {train_accuracy_lda:.4f}")
print(f"Testing Accuracy with LDA: {test_accuracy_lda:.4f}")
print("Training Classification Report with LDA:\n", train_classification_report_lda)
print("Testing Classification Report with LDA:\n", test_classification_report_lda)
print("Training Confusion Matrix with LDA:\n", train_confusion_matrix_lda)
print("Testing Confusion Matrix with LDA:\n", test_confusion_matrix_lda)
# The accuracy scores, classification reports, and confusion matrices provide insights into the model's performance on both the training and testing datasets.
# The classification report includes precision, recall, and F1-score for each class, while the confusion matrix shows the number of correct and incorrect predictions for each class.
No description has been provided for this image
Training Accuracy with LDA: 1.0000
Testing Accuracy with LDA: 1.0000
Training Classification Report with LDA:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        40
  versicolor       1.00      1.00      1.00        41
   virginica       1.00      1.00      1.00        39

    accuracy                           1.00       120
   macro avg       1.00      1.00      1.00       120
weighted avg       1.00      1.00      1.00       120

Testing Classification Report with LDA:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Training Confusion Matrix with LDA:
 [[40  0  0]
 [ 0 41  0]
 [ 0  0 39]]
Testing Confusion Matrix with LDA:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
In [ ]:
 

Dimensionality Reduction Using PCA¶

In [57]:
# Dimensionality Reduction Using PCA

value = pd.read_csv(
    filepath_or_buffer="https://raw.githubusercontent.com/insaid2018/pca-file/master/train.csv")
print('Shape of the dataset:', value.shape)
value.head()
# https://raw.githubusercontent.com/insaid2018/pca-file/master/train.csv
Shape of the dataset: (4459, 4993)
Out[57]:
ID target 48df886f9 0deb4b6a8 34b15f335 a8cb14b00 2f0771a37 30347e683 d08d1fbe3 6ee66e115 ... 3ecc09859 9281abeea 8675bec0b 3a13ed79a f677d4d13 71b203550 137efaa80 fb36b89d9 7e293fbaf 9fc776466
0 000d6aaf2 38000000.0 0.0 0 0.0 0 0 0 0 0 ... 0.0 0.0 0.0 0 0 0 0 0 0 0
1 000fbd867 600000.0 0.0 0 0.0 0 0 0 0 0 ... 0.0 0.0 0.0 0 0 0 0 0 0 0
2 0027d6b71 10000000.0 0.0 0 0.0 0 0 0 0 0 ... 0.0 0.0 0.0 0 0 0 0 0 0 0
3 0028cbf45 2000000.0 0.0 0 0.0 0 0 0 0 0 ... 0.0 0.0 0.0 0 0 0 0 0 0 0
4 002a68644 14400000.0 0.0 0 0.0 0 0 0 0 0 ... 0.0 0.0 0.0 0 0 0 0 0 0 0

5 rows × 4993 columns

Observations:

  • We are provided with an anonymized dataset.

  • The dataset contains 4459 observations and 4993 columns.

  • The target feature is numeric and have an average value of 5944923 units.

  • It ranges from 300000 units all the way upto 40000000 units.

In [58]:
X = value.drop(labels=['target'], axis=1)
y = value['target']
In [59]:
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=42)

# Display the shape of training and testing data
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
X_train shape:  (3567, 4992)
y_train shape:  (3567,)
X_test shape:  (892, 4992)
y_test shape:  (892,)
In [65]:
# Filter all the columns of dtype float64 and int64
X_train = X_train.select_dtypes(include=['float64', 'int64'])
X_test = X_test.select_dtypes(include=['float64', 'int64'])
In [66]:
# Instantiating a standard scaler object
scaler = StandardScaler()

# Transforming our data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [67]:
# Applying PCA to capture 80% of variance
pca = PCA(n_components=0.80)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Display the shape of PCA transformed data
print('X_train_pca shape: ', X_train_pca.shape)
print('X_test_pca shape: ', X_test_pca.shape)

# Visualizing the explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for PCA with 80% Variance')
plt.xticks(range(1, len(pca.explained_variance_ratio_) + 1))
plt.show()

# Visualizing the cumulative explained variance for PCA with 80% variance using plotly
cumulative_variance_pca = np.cumsum(pca.explained_variance_ratio_)
fig_pca = px.line(x=range(1, len(cumulative_variance_pca) + 1), y=cumulative_variance_pca, 
                  labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
                  title='Cumulative Explained Variance for PCA with 80% Variance')
fig_pca.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_pca.show()

# Applying Random Forest Regressor on the PCA transformed data with 80% variance
model_pca = RandomForestRegressor(random_state=42)
# Fit the model on the training data with PCA components that capture 80% variance
model_pca.fit(X_train_pca, y_train)

# Predict on the train and test data with PCA components that capture 80% variance
y_train_pred_pca = model_pca.predict(X_train_pca)
y_test_pred_pca = model_pca.predict(X_test_pca)

# Evaluate the model with PCA components that capture 80% variance - r2, RMSE
train_r2_pca = r2_score(y_train, y_train_pred_pca)
test_r2_pca = r2_score(y_test, y_test_pred_pca)
train_rmse_pca = np.sqrt(mean_squared_error(y_train, y_train_pred_pca))
test_rmse_pca = np.sqrt(mean_squared_error(y_test, y_test_pred_pca))
print(f"Training R^2 with PCA components that capture 80% variance: {train_r2_pca:.4f}")
print(f"Testing R^2 with PCA components that capture 80% variance: {test_r2_pca:.4f}")
print(f"Training RMSE with PCA components that capture 80% variance: {train_rmse_pca:.4f}")
print(f"Testing RMSE with PCA components that capture 80% variance: {test_rmse_pca:.4f}")
X_train_pca shape:  (3567, 720)
X_test_pca shape:  (892, 720)
No description has been provided for this image
Training R^2 with PCA components that capture 80% variance: 0.8825
Testing R^2 with PCA components that capture 80% variance: 0.0333
Training RMSE with PCA components that capture 80% variance: 2878888.3490
Testing RMSE with PCA components that capture 80% variance: 7404056.9097

Happy Learning¶